Introduction to Solving Biological Problems Using R - Day 1

Mark Dunning and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy John Davey, Dávid Molnár and Ian Roberts

Last modified: 12 Jun 2015

Day 1 Schedule

  1. Introduction to R and its environment
  2. Data Structures
  3. Data Analysis Example
  4. Plotting in R

1. Introduction to R and its environment

What’s R?

The R-project page

http://www.r-project.org/

R screenshot

R in the New York Times

http://goo.gl/pww4ZO

R new york time

R in Nature

R in nature

R plotting capabilities

http://spatial.ly/2012/02/great-maps-ggplot2/ R cycle journeys

R plotting capabilities

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919 R facebook

Who uses R? Not just academics!

http://www.revolutionanalytics.com/companies-using-r

Various platforms supported

Getting started

Launching R Using RStudio

To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click

RStudio screenshot

The Working Directory (wd)

setwd("R_course/Day_1_scripts")

Basic concepts in R - command line calculation

2 + 2
## [1] 4
20/5 - sqrt(25) + 3^2
## [1] 8
sin(pi/2)
## [1] 1

Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…

Basic concepts in R - variables

x <- 10
x
## [1] 10
myNumber <- 25
myNumber
## [1] 25
sqrt(myNumber)
## [1] 5
x + myNumber
## [1] 35

Basic concepts in R - variables

x <- 21
x
## [1] 21
x <- myNumber
x
## [1] 25
myNumber <- myNumber + sqrt(16)
myNumber
## [1] 29

Basic concepts in R - functions

sin(x)

this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.

sum(3,4,5,6)
## [1] 18
max(3,4,5,6)
## [1] 6
min(3,4,5,6)
## [1] 3

Basic concepts in R - functions

seq(from = 2, to = 20, by = 4)
## [1]  2  6 10 14 18
seq(2, 20, 4)
## [1]  2  6 10 14 18

Basic concepts in R - vectors

x <- c(3,4,5,6)
x
## [1] 3 4 5 6
x[1]
## [1] 3
x[4]
## [1] 6
y <- c(2,3)
x[y]
## [1] 4 5

Basic concepts in R - vectors

x <- c(3,4,5,6,7,8,9,10,11,12)
x <- 3:12
x
##  [1]  3  4  5  6  7  8  9 10 11 12
x <- seq(2, 20, 4)
x
## [1]  2  6 10 14 18
x <- seq(2, 20, length.out=5)
x
## [1]  2.0  6.5 11.0 15.5 20.0

Basic concepts in R - vectors

y <- rep(3, 5)
y
## [1] 3 3 3 3 3
y <- rep(1:3, 5)
y
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Basic concepts in R - vectors

x <- 3:12
x[3:7]
## [1] 5 6 7 8 9
x[seq(2, 6, 2)]
## [1] 4 6 8
x[rep(3, 2)]
## [1] 5 5

Basic concepts in R - vectors

y <- c(x, 1)
y
##  [1]  3  4  5  6  7  8  9 10 11 12  1
z <- c(x, y)
z
##  [1]  3  4  5  6  7  8  9 10 11 12  3  4  5  6  7  8  9 10
## [19] 11 12  1

Basic concepts in R - vectors

x <- 3:12
x[-3]
## [1]  3  4  6  7  8  9 10 11 12
x[-(5:7)]
## [1]  3  4  5  6 10 11 12
x[-seq(2, 6, 2)]
## [1]  3  5  7  9 10 11 12

Basic concepts in R - vectors

x[6] <- 4
x
##  [1]  3  4  5  6  7  4  9 10 11 12
x[3:5] <- 1
x
##  [1]  3  4  1  1  1  4  9 10 11 12

Remember! Square brackets for indexing [], parentheses for function arguments ().

Basic concepts in R - vector arithmetic

x <- 1:10
y <- x*2
y
##  [1]  2  4  6  8 10 12 14 16 18 20
z <- x^2
z
##  [1]   1   4   9  16  25  36  49  64  81 100

Basic concepts in R - vector arithmetic

y + z
##  [1]   3   8  15  24  35  48  63  80  99 120
x + 1:2
##  [1]  2  4  4  6  6  8  8 10 10 12
x + 1:3
## Warning in x + 1:3: longer object length is not a
## multiple of shorter object length
##  [1]  2  4  6  5  7  9  8 10 12 11

Basic concepts in R - Character vectors and naming

gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
gene.expression <- c(0, 3.2, 1.2, -2)
gene.expression
## [1]  0.0  3.2  1.2 -2.0
names(gene.expression) <- gene.names
gene.expression
##       Pax6 Beta-actin      FoxP2       Hox9 
##        0.0        3.2        1.2       -2.0
names(gene.expression)
## [1] "Pax6"       "Beta-actin" "FoxP2"     
## [4] "Hox9"

Exercise: genes and genomes

Species Genome size (Mb) Protein coding genes
Homo sapiens 3,102 20,774
Mus musculus 2,731 23,139
Drosophila melanogaster 169 13,937
Caenorhabditis elegans 100 20,532
Saccharomyces cerevisiae 12 6,692

Exercise: genes and genomes

Answers to genome exercise

genome.size  <- c(3102, 2731, 169, 100, 12)
coding.genes <- c(20774, 23139, 13937, 20532, 6692)
species.name <- c("H. sapiens","M. musculus",
                  "D. melanogaster","C. elegans",
                  "S. cerevisiae")
names(genome.size)  <- species.name
names(coding.genes) <- species.name
coding.bases <- coding.genes*0.0015
coding.bases
##      H. sapiens     M. musculus D. melanogaster 
##         31.1610         34.7085         20.9055 
##      C. elegans   S. cerevisiae 
##         30.7980         10.0380

Answers to genome exercise

coding.pc <- coding.bases/genome.size*100
coding.pc
##      H. sapiens     M. musculus D. melanogaster 
##        1.004545        1.270908       12.370118 
##      C. elegans   S. cerevisiae 
##       30.798000       83.650000
coding.bases[1]/coding.bases[5]
## H. sapiens 
##   3.104304
genome.size[1]/genome.size[5]
## H. sapiens 
##      258.5
names(coding.pc) <- NULL
coding.pc
## [1]  1.004545  1.270908 12.370118 30.798000
## [5] 83.650000

Writing scripts with RStudio

Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.

x <- 2 + 2
print(x)

Sourcing can also be performed manually with source("myScript.R")

Getting help

?seq
example(seq)
??plot

Interacting with the R console

R packages

source("http://bioconductor.org/biocLite.R")
biocLite("PackageName")

Exercise: Install packages ggplot2 and DESeq

Exercise: Install packages ggplot2 and DESeq

install.packages("ggplot2")
source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")
library(ggplot2) # loads ggplot functions
library(DESeq)   # loads DESeq functions
library()        # Lists all the packages you've got installed 

2. Data structures

R is designed to handle experimental data

The patients data frame

First_Name Second_Name Full_Name Sex Age Weight Consent
1 Adam Jones Adam Jones Male 50 70.8 TRUE
2 Eve Parker Eve Parker Female 21 67.9 TRUE
3 John Evans John Evans Male 35 75.3 FALSE
4 Mary Davis Mary Davis Female 45 61.9 TRUE
5 Peter Baker Peter Baker Male 28 72.4 FALSE
6 Paul Daniels Paul Daniels Male 31 69.9 FALSE
7 Joanna Edwards Joanna Edwards Female 42 63.5 FALSE
8 Matthew Smith Matthew Smith Male 33 71.5 TRUE
9 David Roberts David Roberts Male 57 73.2 FALSE
10 Sally Wilson Sally Wilson Female 62 64.8 TRUE

Character, numeric and logical data types

age    <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 63.5, 
71.5, 73.2, 64.8)
firstName <- c("Adam", "Eve", "John", "Mary", "Peter", 
"Paul", "Joanna", "Matthew", "David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis", 
"Baker","Daniels", "Edwards", "Smith", "Roberts", "Wilson")
consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE, TRUE)

Character, numeric and logical data types

c(20, "a string", TRUE)
## [1] "20"       "a string" "TRUE"
 class(firstName)
## [1] "character"
 class(age)
## [1] "numeric"
 class(weight)
## [1] "numeric"
 class(consent)
## [1] "logical"

Factors

sex <- c("Male", "Female", "Male", "Female", "Male", "Male",
"Female", "Male", "Male", "Female")
sex
##  [1] "Male"   "Female" "Male"   "Female" "Male"   "Male"  
##  [7] "Female" "Male"   "Male"   "Female"
factor(sex)
##  [1] Male   Female Male   Female Male   Male   Female Male  
##  [9] Male   Female
## Levels: Female Male

Creating a data frame (first attempt)

patients <- data.frame(firstName, secondName, 
                       paste(firstName, secondName),  
                       sex, age, weight, consent)
patients
##    firstName secondName paste.firstName..secondName.    sex
## 1       Adam      Jones                   Adam Jones   Male
## 2        Eve     Parker                   Eve Parker Female
## 3       John      Evans                   John Evans   Male
## 4       Mary      Davis                   Mary Davis Female
## 5      Peter      Baker                  Peter Baker   Male
## 6       Paul    Daniels                 Paul Daniels   Male
## 7     Joanna    Edwards               Joanna Edwards Female
## 8    Matthew      Smith                Matthew Smith   Male
## 9      David    Roberts                David Roberts   Male
## 10     Sally     Wilson                 Sally Wilson Female
##    age weight consent
## 1   50   70.8    TRUE
## 2   21   67.9    TRUE
## 3   35   75.3   FALSE
## 4   45   61.9    TRUE
## 5   28   72.4   FALSE
## 6   31   69.9   FALSE
## 7   42   63.5   FALSE
## 8   33   71.5    TRUE
## 9   57   73.2   FALSE
## 10  62   64.8    TRUE

Naming data frame variables

patients$age
##  [1] 50 21 35 45 28 31 42 33 57 62
names(patients) <- c("First_Name", "Second_Name", 
"Full_Name", "Sex", "Age", "Weight", "Consent")
names(patients)
## [1] "First_Name"  "Second_Name"
## [3] "Full_Name"   "Sex"        
## [5] "Age"         "Weight"     
## [7] "Consent"

Naming data frame variables

patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName,secondName), 
                       Sex = sex,
                       Age = age,
                       Weight = weight, 
                       Consent = consent)
names(patients)
## [1] "First_Name"  "Second_Name"
## [3] "Full_Name"   "Sex"        
## [5] "Age"         "Weight"     
## [7] "Consent"

Factors in data frames

patients$First_Name
##  [1] Adam    Eve     John    Mary   
##  [5] Peter   Paul    Joanna  Matthew
##  [9] David   Sally  
## 10 Levels: Adam David Eve ... Sally

Factors in data frames

patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName, secondName), 
                       Sex = factor(sex),
                       Age = age,
                       Weight = weight,
                       Consent = consent,
                       stringsAsFactors = FALSE)
 patients$Sex
##  [1] Male   Female Male   Female Male  
##  [6] Male   Female Male   Male   Female
## Levels: Female Male
 patients$First_Name
##  [1] "Adam"    "Eve"     "John"   
##  [4] "Mary"    "Peter"   "Paul"   
##  [7] "Joanna"  "Matthew" "David"  
## [10] "Sally"

Matrices

e <- matrix(1:10, nrow=5, ncol=2)
e
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
f <- matrix(1:10, nrow=2, ncol=5)
f
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
f %*% e
##      [,1] [,2]
## [1,]   95  220
## [2,]  110  260

The %*% operator is the matrix multiplication operator, not the standard multiplication operator.

Lists

one.to.ten <- 1:10
uniform.mat <- matrix(runif(100), ncol=10, nrow=10)
year.to.october <- data.frame(one.to.ten, month.name[1:10])
myList <- list(ls.obj.1 = one.to.ten, 
               ls.obj.2 = uniform.mat, 
               ls.obj.3 = year.to.october)
names(myList)
## [1] "ls.obj.1" "ls.obj.2" "ls.obj.3"

Lists

myList$ls.obj.1
##  [1]  1  2  3  4  5  6  7  8  9 10
 myList[[1]] 
##  [1]  1  2  3  4  5  6  7  8  9 10

Indexing data frames and matrices

object[rows, colums]

e[1,2]
## [1] 6
e[1,]
## [1] 1 6
patients[1,2]
## [1] "Jones"
patients[1,]
##   First_Name Second_Name  Full_Name
## 1       Adam       Jones Adam Jones
##    Sex Age Weight Consent
## 1 Male  50   70.8    TRUE

Advanced indexing

 s <- letters[1:5]
 s[c(1,3)]
## [1] "a" "c"
 s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
## [1] "a" "c"
a <- 1:5
a < 3
## [1]  TRUE  TRUE FALSE FALSE FALSE
s[a < 3]
## [1] "a" "b"
s[a > 1 & a <3]
## [1] "b"
s[a == 2]
## [1] "b"

Operators

Exercise

source("1.2_patients.R")

Logical indexing answers

patients[patients$Age < 40,]
patients[patients$Consent == TRUE,]
patients[patients$Sex == "Male" & patients$Weight >= 70.8,]

Basic R ‘Built-in’ functions for working with data frames

names(patients)
## [1] "First_Name"  "Second_Name"
## [3] "Full_Name"   "Sex"        
## [5] "Age"         "Weight"     
## [7] "Consent"
colnames(patients)
## [1] "First_Name"  "Second_Name"
## [3] "Full_Name"   "Sex"        
## [5] "Age"         "Weight"     
## [7] "Consent"
rownames(patients)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7" 
##  [8] "8"  "9"  "10"

Basic R ‘Built-in’ functions for working with data frames

nrow(patients)
## [1] 10
ncol(patients)
## [1] 7
dim(patients)
## [1] 10  7

Basic R ‘Built-in’ functions for working with data frames

length(c(1,2,3,4,5))
## [1] 5
length(patients)
## [1] 7
length(patients$Age)
## [1] 10

Remember, a data frame is a list of variables, so its length is the number of variables. The length of one of the variable vectors (like Age) is the number of observations

Basic R ‘Built-in’ functions for working with data frames

newpatient <- c("Kate", "Lawson", "Kate Lawson", "Female", "35", "62.5","TRUE")
rbind(patients,newpatient)
##    First_Name Second_Name      Full_Name    Sex Age Weight Consent
## 1        Adam       Jones     Adam Jones   Male  50   70.8    TRUE
## 2         Eve      Parker     Eve Parker Female  21   67.9    TRUE
## 3        John       Evans     John Evans   Male  35   75.3   FALSE
## 4        Mary       Davis     Mary Davis Female  45   61.9    TRUE
## 5       Peter       Baker    Peter Baker   Male  28   72.4   FALSE
## 6        Paul     Daniels   Paul Daniels   Male  31   69.9   FALSE
## 7      Joanna     Edwards Joanna Edwards Female  42   63.5   FALSE
## 8     Matthew       Smith  Matthew Smith   Male  33   71.5    TRUE
## 9       David     Roberts  David Roberts   Male  57   73.2   FALSE
## 10      Sally      Wilson   Sally Wilson Female  62   64.8    TRUE
## 11       Kate      Lawson    Kate Lawson Female  35   62.5    TRUE

Basic R ‘Built-in’ functions for working with data frames

cbind(patients,10:1)
##    First_Name Second_Name      Full_Name    Sex Age Weight Consent 10:1
## 1        Adam       Jones     Adam Jones   Male  50   70.8    TRUE   10
## 2         Eve      Parker     Eve Parker Female  21   67.9    TRUE    9
## 3        John       Evans     John Evans   Male  35   75.3   FALSE    8
## 4        Mary       Davis     Mary Davis Female  45   61.9    TRUE    7
## 5       Peter       Baker    Peter Baker   Male  28   72.4   FALSE    6
## 6        Paul     Daniels   Paul Daniels   Male  31   69.9   FALSE    5
## 7      Joanna     Edwards Joanna Edwards Female  42   63.5   FALSE    4
## 8     Matthew       Smith  Matthew Smith   Male  33   71.5    TRUE    3
## 9       David     Roberts  David Roberts   Male  57   73.2   FALSE    2
## 10      Sally      Wilson   Sally Wilson Female  62   64.8    TRUE    1

Basic R ‘Built-in’ functions for working with data frames

patients[-1, ] # remove first row
##    First_Name Second_Name      Full_Name    Sex Age Weight Consent
## 2         Eve      Parker     Eve Parker Female  21   67.9    TRUE
## 3        John       Evans     John Evans   Male  35   75.3   FALSE
## 4        Mary       Davis     Mary Davis Female  45   61.9    TRUE
## 5       Peter       Baker    Peter Baker   Male  28   72.4   FALSE
## 6        Paul     Daniels   Paul Daniels   Male  31   69.9   FALSE
## 7      Joanna     Edwards Joanna Edwards Female  42   63.5   FALSE
## 8     Matthew       Smith  Matthew Smith   Male  33   71.5    TRUE
## 9       David     Roberts  David Roberts   Male  57   73.2   FALSE
## 10      Sally      Wilson   Sally Wilson Female  62   64.8    TRUE

Basic R ‘Built-in’ functions for working with data frames

patients[ ,-1] # remove first column
##    Second_Name      Full_Name    Sex Age Weight Consent
## 1        Jones     Adam Jones   Male  50   70.8    TRUE
## 2       Parker     Eve Parker Female  21   67.9    TRUE
## 3        Evans     John Evans   Male  35   75.3   FALSE
## 4        Davis     Mary Davis Female  45   61.9    TRUE
## 5        Baker    Peter Baker   Male  28   72.4   FALSE
## 6      Daniels   Paul Daniels   Male  31   69.9   FALSE
## 7      Edwards Joanna Edwards Female  42   63.5   FALSE
## 8        Smith  Matthew Smith   Male  33   71.5    TRUE
## 9      Roberts  David Roberts   Male  57   73.2   FALSE
## 10      Wilson   Sally Wilson Female  62   64.8    TRUE

Basic R ‘Built-in’ functions for working with data frames

sort(patients$Second_Name)
##  [1] "Baker"   "Daniels" "Davis"   "Edwards" "Evans"   "Jones"   "Parker" 
##  [8] "Roberts" "Smith"   "Wilson"
order(patients$Second_Name)
##  [1]  5  6  4  7  3  1  2  9  8 10
patients[order(patients$Second_Name),]
##    First_Name Second_Name      Full_Name    Sex Age Weight Consent
## 5       Peter       Baker    Peter Baker   Male  28   72.4   FALSE
## 6        Paul     Daniels   Paul Daniels   Male  31   69.9   FALSE
## 4        Mary       Davis     Mary Davis Female  45   61.9    TRUE
## 7      Joanna     Edwards Joanna Edwards Female  42   63.5   FALSE
## 3        John       Evans     John Evans   Male  35   75.3   FALSE
## 1        Adam       Jones     Adam Jones   Male  50   70.8    TRUE
## 2         Eve      Parker     Eve Parker Female  21   67.9    TRUE
## 9       David     Roberts  David Roberts   Male  57   73.2   FALSE
## 8     Matthew       Smith  Matthew Smith   Male  33   71.5    TRUE
## 10      Sally      Wilson   Sally Wilson Female  62   64.8    TRUE

Exercise

Answers

source("1.2_patients.R")
patients[order(patients$Age,decreasing = TRUE),]
##    First_Name Second_Name      Full_Name    Sex Age Weight Consent
## 10      Sally      Wilson   Sally Wilson Female  62   64.8    TRUE
## 9       David     Roberts  David Roberts   Male  57   73.2   FALSE
## 1        Adam       Jones     Adam Jones   Male  50   70.8    TRUE
## 4        Mary       Davis     Mary Davis Female  45   61.9    TRUE
## 7      Joanna     Edwards Joanna Edwards Female  42   63.5   FALSE
## 3        John       Evans     John Evans   Male  35   75.3   FALSE
## 8     Matthew       Smith  Matthew Smith   Male  33   71.5    TRUE
## 6        Paul     Daniels   Paul Daniels   Male  31   69.9   FALSE
## 5       Peter       Baker    Peter Baker   Male  28   72.4   FALSE
## 2         Eve      Parker     Eve Parker Female  21   67.9    TRUE
patients2 <- cbind(patients, Height = c(1.8, 1.6, 1.9, 1.5, 1.7, 1.5,1.75,1.8,1.6,1.5))
patients2
##    First_Name Second_Name      Full_Name    Sex Age Weight Consent Height
## 1        Adam       Jones     Adam Jones   Male  50   70.8    TRUE   1.80
## 2         Eve      Parker     Eve Parker Female  21   67.9    TRUE   1.60
## 3        John       Evans     John Evans   Male  35   75.3   FALSE   1.90
## 4        Mary       Davis     Mary Davis Female  45   61.9    TRUE   1.50
## 5       Peter       Baker    Peter Baker   Male  28   72.4   FALSE   1.70
## 6        Paul     Daniels   Paul Daniels   Male  31   69.9   FALSE   1.50
## 7      Joanna     Edwards Joanna Edwards Female  42   63.5   FALSE   1.75
## 8     Matthew       Smith  Matthew Smith   Male  33   71.5    TRUE   1.80
## 9       David     Roberts  David Roberts   Male  57   73.2   FALSE   1.60
## 10      Sally      Wilson   Sally Wilson Female  62   64.8    TRUE   1.50

Answers

bmi <- patients2$Weight / (patients2$Height)^2
bmi
##  [1] 21.85185 26.52344 20.85873 27.51111 25.05190 31.06667 20.73469 22.06790
##  [9] 28.59375 28.80000
patients2[bmi > 25 & patients2$Consent,]
##    First_Name Second_Name    Full_Name    Sex Age Weight Consent Height
## 2         Eve      Parker   Eve Parker Female  21   67.9    TRUE    1.6
## 4        Mary       Davis   Mary Davis Female  45   61.9    TRUE    1.5
## 10      Sally      Wilson Sally Wilson Female  62   64.8    TRUE    1.5

3. R for data analysis

3 steps to Basic Data Analysis

  1. Reading in data
    • read.table()
    • read.csv(), read.delim()
  2. Analysis
    • Manipulating & reshaping the data
    • Any maths you like
    • Plotting the outcome
  3. Writing out results
    • write.table()
    • write.csv()

A simple walkthrough

1. Read in the data

rawData <- read.delim("1.3_NBcountData.txt")
rawData[1:10,]
read.csv("1.3_NBcountData.csv")
?read.table

1. Read in the data

rawData[1:10,]  # View the first 10 rows to ensure import is OK
##    Patient Nuclei NB_Amp NB_Nor NB_Del
## 1        1     65      0     63      2
## 2        2     51      3     43      5
## 3        3     37      2     35      0
## 4        4     37      2     35      0
## 5        5     45      2     42      1
## 6        6     46      4     41      1
## 7        7     65      1     64      0
## 8        8     59      1     54      4
## 9        9     49      0     48      1
## 10      10     46      0     45      1

Word of caution

tolstoy

hadley

Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)

http://vimeo.com/33727555

http://vita.had.co.nz/papers/tidy-data.pdf

Handling missing values

x <- c(1,NA,3)
mean(x)
## [1] NA
mean(x, na.rm=TRUE)
## [1] 2
mean(na.omit(x))
## [1] 2
 is.na(x)
## [1] FALSE  TRUE FALSE

2. Analysis (reshaping data and maths)

# create an index of results:
prop <- rawData$NB_Amp / rawData$Nuclei 
# Get sample names of amplified patients:
amp <- which(prop > 0.33) 
plot(prop, ylim=c(0,1))
abline(h=0.33) # Add a horizonal line

3. Outputting the results

write.csv(rawData[amp,], file="selectedSamples.csv")
getwd() # print working directory
list.files() # list files in working directory

Data analysis exercise: Which samples are near normal?

Solution to NB normality test

norm <- which(prop < 0.33 & rawData$NB_Del == 0)
norm
##  [1]  3  4  7 15 20 24 36 37 42 47
write.csv(rawData[norm,], "My_NB_output.csv")

4. Plotting in R

Plot basics

Making a Scatter Plot

patients$Weight
##  [1] 70.8 67.9 75.3 61.9 72.4 69.9 63.5 71.5 73.2 64.8
plot(patients$Weight)

Making a Scatter Plot

R tries to guess the most appropriate way to visualise the data

Making a Scatter Plot of two variables

patients$Age
##  [1] 50 21 35 45 28 31 42 33 57 62
plot(patients$Weight,patients$Age)

Making a barplot

barplot(patients$Age)

Making a barplot

barplot(summary(patients$Sex))

Plotting a distribution: Histogram

hist(patients$Weight)

Plotting a distribution: Histogram

hist(patients$Age)

Plotting a distribution: Boxplot

boxplot(patients$Weight)

Plotting a distribution: Boxplot

boxplot(patients$Weight~patients$Sex)

Exercise: Data exploration

Suggestions

weather <- read.csv("ozone.csv")
head(weather)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Suggestions

plot(weather$Ozone,weather$Solar.R)

Suggestions

hist(weather$Temp)

Suggestions

boxplot(weather$Ozone)

Suggestions

boxplot(weather$Ozone~weather$Month)

Simple customisations

plot(patients$Weight,type="l")

Simple customisations

plot(patients$Weight,type="b")

Simple customisations

plot(patients$Weight,patients$Age,main="Relationship between Weight and Age")

Simple customisations

plot(patients$Weight,patients$Age,xlab="Weight")

Simple customisations

plot(patients$Weight,patients$Age,ylab="Age")

Simple customisations

plot(patients$Weight,patients$Age,
     ylab="Age",
     xlab="Weight",
     main="Relationship between Weight and Age",
     ylim=c(10,70),
     xlim=c(60,80))

Defining a colour

Use of colours

Changing the col argument to plot changes the colour that the points are plotted in

plot(patients$Weight,patients$Age,col="red")

Plotting characters

plot(patients$Weight,patients$Age, pch=16)

Plotting characters

Plotting characters

plot(patients$Weight,patients$Age, pch="X")

Size of points

Character expansion

plot(patients$Weight,patients$Age, cex=2)

Size of points

Character expansion

plot(patients$Weight,patients$Age, cex=0.2)

Colours and characters as vectors

plot(patients$Weight,patients$Age, 
     pch = 1:10,
     col=c("red","orange","green","blue"),
     cex = 1:5)

Other plots use the same arguments

boxplot(patients$Weight~patients$Sex,
        xlab="Sex",
        ylab="Weight",
        main="Relationship between Weight and Gender",
        col=c("blue","yellow"))

Exercise

End of Day 1